Power Outage Analysis¶

Name(s): Zhongyan Luo

Website Link: https://github.com/Zhongyan0721/Power-Outage-Analysis

In [447]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'
pd.options.plotting.backend = 'plotly'

from dsc80_utils import *

import warnings
warnings.filterwarnings('ignore')

Step 1: Introduction¶

​Power outages have become an increasingly pressing concern due to their frequency and impact on modern society. Recent data indicates that the United States has experienced a significant rise in weather-related power outages during warmer months, with a 60% increase over the past decade compared to the 2000s. This uptick is largely attributed to climate change, which intensifies storms, wildfires, and other extreme weather events, placing additional stress on aging power infrastructure. Beyond the immediate inconvenience, outages pose serious risks to public health and safety. For instance, individuals reliant on medical devices such as ventilators face life-threatening situations during prolonged outages. Economically, power outages can be devastating. Businesses suffer from operational disruptions, leading to great financial losses.

This dataset contains detailed information about power outage occured across states in US. It includes data on the time & location of outages, environmental and demographic factors, impacted population, etc. Some key features include climate-related anomalies, population density, land and water percentage, and other attributes that may influence power outages. The dataset consists of 1,534 records and 56 columns.

By researching the dataset, we can get some insight abou the following questions:

  • How have power outages changed over the years? Are they becoming more frequent or severe?
  • Which states or regions experience the most outages? Is there a pattern based on location?
  • How do climate anomalies (e.g., cold vs. warm) correlate with the frequency and duration of outages?
  • Are there specific months or seasons when outages are more likely to occur?

In the predictive analysis of this project, I will predict the impacted population of power outage based on other related factors.

In [448]:
file_path = 'data/outage.xlsx'
xls = pd.ExcelFile(file_path)
df = pd.read_excel('data/outage.xlsx')

Step 2: Data Cleaning and Exploratory Data Analysis¶

In [449]:
numeric_columns = ['MONTH', 'ANOMALY.LEVEL', 'OUTAGE.DURATION', 'DEMAND.LOSS.MW',
                   'CUSTOMERS.AFFECTED', 'RES.PRICE', 'COM.PRICE', 'IND.PRICE',
                   'TOTAL.PRICE', 'RES.SALES', 'COM.SALES', 'IND.SALES', 'TOTAL.SALES',
                   'RES.PERCEN', 'COM.PERCEN', 'IND.PERCEN', 'POPDEN_UC', 'POPDEN_RURAL']

df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

categorical_columns = ['CLIMATE.REGION', 'CLIMATE.CATEGORY', 'CAUSE.CATEGORY', 'CAUSE.CATEGORY.DETAIL', 'HURRICANE.NAMES']
df[categorical_columns] = df[categorical_columns].replace("", pd.NA)

datetime_columns = ['OUTAGE.START.DATE', 'OUTAGE.RESTORATION.DATE']
for col in datetime_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')
In [450]:
cause_counts = df["CAUSE.CATEGORY"].value_counts().reset_index()
cause_counts.columns = ["CAUSE.CATEGORY", "Count"]

fig_causes = px.bar(
    cause_counts,
    x="Count",
    y="CAUSE.CATEGORY",
    orientation="h",
    title="Most Common Causes of Major Outages",
    labels={"CAUSE.CATEGORY": "Cause Category"}
)
fig_causes.update_layout(xaxis_title="Count")
fig_causes.show()

fig2 = px.histogram(df, x="CUSTOMERS.AFFECTED", nbins=50, title="Distribution of Customers Affected",
                    labels={"CUSTOMERS.AFFECTED": "Customers Affected"}, log_y=True)

fig2.show()
In [451]:
year_counts = df["YEAR"].value_counts().reset_index()
year_counts.columns = ["YEAR", "Count"]

fig_years = px.bar(
    year_counts,
    x="YEAR",
    y="Count",
    title="Trend of Power Outages Over the Years",
    labels={"YEAR": "Year", "Count": "Number of Outages"}
)
fig_years.update_layout(xaxis=dict(dtick=1))
fig_years.show()


fig_pop = px.box(
    df,
    x="CAUSE.CATEGORY",
    y="OUTAGE.DURATION",
    title="Outage Duration by Cause Category",
    labels={"CAUSE.CATEGORY": "Causes", "OUTAGE.DURATION": "Duration"},
    range_y=[0, 30000]
)
fig_pop.show()
In [452]:
agg_table = df.groupby(['MONTH', 'ANOMALY.LEVEL']).agg({
    'OBS': 'count', 
    'PCT_LAND': 'mean', 
    'PCT_WATER_TOT': 'mean', 
}).reset_index()

agg_table
Out[452]:
MONTH ANOMALY.LEVEL OBS PCT_LAND PCT_WATER_TOT
0 1.0 -1.6 2 93.88 6.12
1 1.0 -1.4 6 86.91 13.09
2 1.0 -1.3 14 83.78 16.22
... ... ... ... ... ...
130 12.0 1.1 6 95.80 4.20
131 12.0 1.3 6 88.32 11.68
132 12.0 2.3 15 94.08 5.92

133 rows × 5 columns

Step 3: Assessment of Missingness¶

In [453]:
df[df['CUSTOMERS.AFFECTED'].isna()]['CAUSE.CATEGORY'].value_counts()
target_column = "CUSTOMERS.AFFECTED"

df["MISSING_CUSTOMERS_AFFECTED"] = df["CUSTOMERS.AFFECTED"].isnull().astype(int)

def permutation_test(df, target_col, test_col, num_permutations=1000):
    observed_diff = abs(df.groupby(test_col)[target_col].mean().diff().iloc[-1])

    perm_diffs = []
    for _ in range(num_permutations):
        shuffled = np.random.permutation(df[target_col])
        perm_diff = abs(pd.Series(shuffled).groupby(df[test_col]).mean().diff().iloc[-1])
        perm_diffs.append(perm_diff)

    p_value = np.mean(np.array(perm_diffs) >= observed_diff)
    return p_value

cols = ["CLIMATE.CATEGORY", "CAUSE.CATEGORY", "RES.PRICE", "TOTAL.PRICE"]

results = {}
for col in cols:
    if df[col].isnull().sum() == 0:
        results[col] = permutation_test(df, "MISSING_CUSTOMERS_AFFECTED", col)
    else:
        results[col] = permutation_test(df.dropna(subset=[col]), "MISSING_CUSTOMERS_AFFECTED", col)

results
Out[453]:
{'CLIMATE.CATEGORY': np.float64(0.168),
 'CAUSE.CATEGORY': np.float64(0.0),
 'RES.PRICE': np.float64(1.0),
 'TOTAL.PRICE': np.float64(1.0)}

The missingness of "CUSTOMERS.AFFECTED" depends on "CAUSE.CATEGORY" (p-value = 0.0). This suggests that the likelihood of missing values in "CUSTOMERS.AFFECTED" is influenced by the cause of the outage. Certain outage causes may not require or report the number of affected customers, leading to systematic missingness.

The missingness of "CUSTOMERS.AFFECTED" does not depend on "RES.PRICE" (p-value = 1.0) and "TOTAL.PRICE" (p-value = 1.0). This indicates that the likelihood of missing values in "CUSTOMERS.AFFECTED" is independent of electricity prices, meaning the missingness is likely unrelated to economic factors.

Step 4: Hypothesis Testing¶

In [454]:
df['CAUSE.CATEGORY'].value_counts()
tested_dataset = df[df['CAUSE.CATEGORY'].isin(['severe weather', 'intentional attack'])][['CUSTOMERS.AFFECTED', 'CAUSE.CATEGORY']]
tested_dataset.groupby('CAUSE.CATEGORY').mean()
n_repetitions = 1000

differences = []
for _ in range(n_repetitions):
    
    with_shuffled = tested_dataset.assign(Shuffled = np.random.permutation(tested_dataset['CAUSE.CATEGORY']))

    group_means = (
        with_shuffled
        .groupby('Shuffled')['CUSTOMERS.AFFECTED']
        .mean()
    )
    difference = group_means.loc['severe weather'] - group_means.loc['intentional attack']
    
    # Step 4: Store the result
    differences.append(difference)

observed = tested_dataset.groupby('CAUSE.CATEGORY')['CUSTOMERS.AFFECTED'].mean()
observed_diff = observed.loc['severe weather'] - observed.loc['intentional attack']
(differences > observed_diff).mean()
Out[454]:
np.float64(0.0)

Step 5: Framing a Prediction Problem¶

Predicting the population affected by power outages is crucial for mitigating risks, improving response strategies, and ensuring public safety. As climate change and infrastructure challenges increase the frequency and severity of outages, understanding which populations are most vulnerable allows for better preparedness and resource allocation.

In [455]:
df['CAUSE.CATEGORY'].value_counts()
Out[455]:
CAUSE.CATEGORY
severe weather                   763
intentional attack               418
system operability disruption    127
public appeal                     69
equipment failure                 60
fuel supply emergency             51
islanding                         46
Name: count, dtype: int64

Step 6: Baseline Model¶

In [456]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

drop_columns = [
    "OBS", "OUTAGE.START.DATE", "OUTAGE.START.TIME", "OUTAGE.RESTORATION.DATE", 
    "OUTAGE.RESTORATION.TIME", "CAUSE.CATEGORY.DETAIL", "HURRICANE.NAMES"
]
base_df_cleaned = df.drop(columns=drop_columns)
features = ['YEAR', 'MONTH', 'CLIMATE.REGION', 'ANOMALY.LEVEL', 'CLIMATE.CATEGORY', 'RES.CUSTOMERS']
base_df_cleaned = base_df_cleaned[features + ['CAUSE.CATEGORY']]

for col in base_df_cleaned.select_dtypes(include=["float64", "int64"]).columns:
    base_df_cleaned[col].fillna(base_df_cleaned[col].median(), inplace=True)

for col in base_df_cleaned.select_dtypes(include=["object"]).columns:
    base_df_cleaned[col].fillna(base_df_cleaned[col].mode()[0], inplace=True)

label_encoders = {}
for col in base_df_cleaned.select_dtypes(include=["object"]).columns:
    le = LabelEncoder()
    base_df_cleaned[col] = le.fit_transform(base_df_cleaned[col])
    label_encoders[col] = le

X = base_df_cleaned.drop(columns=["CAUSE.CATEGORY"])
y = base_df_cleaned["CAUSE.CATEGORY"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
accuracy
Out[456]:
0.6579804560260586

Step 7: Final Model¶

In [457]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import QuantileTransformer

features = ['YEAR', 'MONTH', 'CLIMATE.REGION', 'ANOMALY.LEVEL', 'CLIMATE.CATEGORY', 'RES.CUSTOMERS', 'CUSTOMERS.AFFECTED', 'PC.REALGSP.STATE', 'POPPCT_URBAN']
df_cleaned = df[features + ['CAUSE.CATEGORY']]

for col in df_cleaned.select_dtypes(include=["float64", "int64"]).columns:
    df_cleaned[col].fillna(df_cleaned[col].median(), inplace=True)

for col in df_cleaned.select_dtypes(include=["object"]).columns:
    df_cleaned[col].fillna(df_cleaned[col].mode()[0], inplace=True)

label_encoders = {}
for col in df_cleaned.select_dtypes(include=["object"]).columns:
    le = LabelEncoder()
    df_cleaned[col] = le.fit_transform(df_cleaned[col])
    label_encoders[col] = le


X = df_cleaned.drop(columns=["CAUSE.CATEGORY"])
y = df_cleaned["CAUSE.CATEGORY"]

pipeline = Pipeline(steps=[
    ("classifier", RandomForestClassifier(random_state=42))
])

param_grid = {
    "classifier__n_estimators": [100, 200, 300],
    "classifier__max_depth": [10, 20, 30],
    "classifier__min_samples_split": [2, 5, 10],
    "classifier__min_samples_leaf": [1, 2, 4]
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

final_accuracy = accuracy_score(y_test, y_pred_best)
final_report = classification_report(y_test, y_pred_best)

final_accuracy
Out[457]:
0.7817589576547231

Step 8: Fairness Analysis¶

In [458]:
from sklearn.metrics import precision_score


median_urban_population = X["POPPCT_URBAN"].median()
group_X = y_test[X_test["POPPCT_URBAN"] >= median_urban_population]  # More urban areas
group_Y = y_test[X_test["POPPCT_URBAN"] < median_urban_population]   # More rural areas


pred_X = best_model.predict(X_test[X_test["POPPCT_URBAN"] >= median_urban_population])
pred_Y = best_model.predict(X_test[X_test["POPPCT_URBAN"] < median_urban_population])


precision_X = precision_score(group_X, pred_X, average="weighted", zero_division=0)
precision_Y = precision_score(group_Y, pred_Y, average="weighted", zero_division=0)


n_permutations = 1000
precision_diffs = []

for _ in range(n_permutations):

    shuffled_labels = np.random.permutation(y_test.values)
    
    shuffled_X = shuffled_labels[X_test["POPPCT_URBAN"] >= median_urban_population]
    shuffled_Y = shuffled_labels[X_test["POPPCT_URBAN"] < median_urban_population]
    
    shuffled_precision_X = precision_score(shuffled_X, pred_X, average="weighted", zero_division=0)
    shuffled_precision_Y = precision_score(shuffled_Y, pred_Y, average="weighted", zero_division=0)
    
    precision_diffs.append(shuffled_precision_X - shuffled_precision_Y)

observed_diff = precision_X - precision_Y

p_value = np.mean(np.abs(precision_diffs) >= np.abs(observed_diff))

p_value
Out[458]:
np.float64(0.083)